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(57) Abstract: TWs invention pertains to methods for the analysis of biological data, particularly spectra, for example, nuclear 
magnetic resonance (NMR) and other types of spectra. More speciHcally, the present invention pertains to a method for processing a 
sample spectrum comprising: replacing each of one or more target regions in said sample spectrum with a corresponding replacement 
region of a master control spectrum to give a target-replaced sample specinim, wherein said replacement region has been scaled so 
as to have the same fraction of the total integrated intensity in said target- replaced sample spectrum as it did in said master control 
spectrum. Possible a^^licalions : methods of identifying a biomarlcer or biomarker combination for an applied stimilus; classiHcation 
of an applied stimilus; diagnosis of an applied stimilus; therapeutic monitoring of a subject undergoing therapy; evaluating drag 
therapy and/or drug efTicacy; detecting toxic side-cfTecLs of drug; characterising and/or identifying a drug in »verd<Kse. 
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METHODS FOR SPECTRAL ANALYSIS AND THEIR APPLICATIONS: 
SPECTRAL REPLACEMENT 

TECHNICAL FIELD 

5 

This invention pertains generally to the field of chemometrios, metabonomics, and, 
more particularly, to methods for the analysis of chemical, biochemical, and 
biological data, for example, spectra, for example, nuclear magnetic resonance 
(NMR) and other types of spectra. 

10 

BACKGROUND 

Significant progress has been made in developing methods to determine and 
quantify the biochemical processes occurring in living systems. Such methods are 
15 valuable in the diagnosis, prognosis and treatment of disease, the development of 
drugs, as well as for improving therapeutic regimes for current drugs. 

Diseases of the human or animal body (such as cancers, degenerative diseases, 
autoimmune diseases and the like) have an underlying basis in alterations in the 
20 expression of certain genes. The expressed gene products, proteins, mediate 
effects such as abnomial cell growth, cell death or inflammation. Some of these 
effects are caused directly by protein-protein Interactions; other are caused by 
proteins acting on small molecules (e.g. "second messengers") which trigger effects 
including further gene expression. 

25 

Likewise, disease states caused by externgit agents such as viruses and bacteria 
provoke a multitude of complex responses in Infected host. 

In a similar manner, the treatment of disease through ttie administration of dmgs 
30 can result in a wide range of desired effects and unwanted side effects In a patient. 



At the genetic level, methods for examining gene expression In response to these 
types of events are often referred to as "genomic methods," and are concerned 
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w'rth the detection and quantification of the expression of an organism's genes, 
collectively refen-ed to as its 'genome,' usually by detecting and/or quantifying 
genetic molecules, such as DNA and RNA. Genomic studies often exploit a new 
generation of proprietary "gene chips,' which are small disposable devices encoded 
with an an-ay of genes that respond to extracted mRNAs produced by cells (see, for 
example, Klenk et al.. 1997). Many genes can be placed on a chip array and 
patterns of gene expression, or changes therein, can be monitored rapidly, 
although at some considerable cost 



However, the biological consequences of gene expression, or altered gene 
expression following pertuitation. are extremely complex. This has led to the 
development of "proteomic methods" which are concerned with the 
semi-quantitative measurement of the production of cellular proteins of an 
organism, collectively referred to as its "proteome' (see. for example, Geisow, 
1998). Proteomic measurements utilise a variety of technologies, but all involve a 
protein separation method, e.g., 2D gel-electrophoresis, allied to a chemical 
characterisation method, usually, some fonn of mass spectrometry. 

In recent years, ft has been appreciated that the reaction of human and animal 
subjects to disease and treatments for them can vary according to the genomic 
makeup of an Individual. This has led to the development of the field of 
"phamnacogenomics." A fuller understanding of how an individual's own genome 
reacts to a parUcular disease will allow the development of new therapies, as well 
as the refinement of existing ones. 

At present, genomic and proteomic methods, which are both expensive and labour 
Intensive, have the potential to be powerful tools for studying biological response. 
The choice of method is still uncertain since careful studies have sometimes shown 
a low conelatlon between the pattern of gene expression and the pattern of protein 
expression, probably due to sampling for the two technologies at inappropriate time 
points (see. e.g., Gygi etal.. 1999). Even in combination, genomic and proteomic 
methods still do not provide the range of infonnation needed for understanding 



wo 02/052293 



PCT/GB01/05685 



-3- 

integrated cellular function In a living system, since they do not take account of the 
dynamic metabolic status of the whole organism. 



For example, genomic and proteomic studies may implicate a particular gene or 
5 protein in a disease or a xenobiotic response because the level of expression is 
altered, but the change in gene or protein level may be transitory or may be 
counteracted downstream and as a result there may be no effect at the cellular 
and/or biochemical level. Conversely, sampling tissue for genomic and proteomic 
studies at inappropriate time points may result in a relevant gene or protein being 
10 overlooked. 



Nonetheless, recent advances in genomics and proteomics now permit the rapid 
identification of new potential targets for drug development. With a new target in 
hand, and with the aid of combinatorial chemistry and high throughput screening, 
15 the pharmaceutical industry is capable of rapidly generating and screening 
thousands of new candidate compounds each week. 



However, in practice, only a few of these candidate compounds will be taken 
further, for example, into pre-clinical and clinical development. It is therefore critical 

20 to identify those candidate compounds with the most promise, and this is usually 
judged by efficacy and toxicology, before selection for clinical studies. However, 
these selection processes are imperfect and many drugs fail in clinical trials due to 
lack of efficacy and/or toxlcological effects. It is also possible that other drugs may 
fail overall because they are only effective in a subgroup of patients who have an 

25 unrecognised phamnacogenomic response. There is a great need to find new ways 
of reducing this compound "attrition" or losses of drugs late in the development 
process, for example, through the development and application of analytical 
technologies designed to maximise efficiency of compound selection and to 
minimise attrition rates. 

30 

While genomic and proteomic methods may be useful aids in compound selection, 
they do suffer from substantial limitations. For example, while genomic and 
proteomic methods may ultimately give profound insights, into toxlcological 
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mechanisms and provide new sun-ogate biomarkers of disease, at present it is very 
difficult to relate genomic and proteomic findings to classical cellular or biochemical 
indices or endpoints. One simple reason for this is that with current technology and 
approach, the correlation of the time-response to drug exposure is difficult. Further 
difficulties arise with in vffro cell-based studies. These difficulties are particularly 
important for the many known cases where the metabolism of the compound is a 
prerequisite for a toxic effect and especially true where the target organ is not the 
site of primary metabolism. This is particularly tme for pro-drugs, where some 
aspect of in situ chemical (e.g.. enzymatic) modification is required for activity. 

A new "metabonomic" approach has been proposed which is aimed at augmenting 
and complementing the Infomfiation provided by genomics and proteomics. 
"Metabonomlcs" Is conventionally defined as "the quantitative measurement of the 
multlparametric metabolic response of living systems to pathophysiological stimuli 
or genetic modification" (see, for example, Nicholson et ai., 1999). This concept 
has arisen primarily from the application of NMR spectroscopy to study the 
metabolic composition of bloflulds. cells, and tissues and from studies utilising 
patlem recognition (PR), expert systems and other chemoinfomiatic tools to 
Interpret and classify complex NMR-generated metabolic data sets. Metabonomic 
methods have the potential, ultimately, to determine the entire dynamic metabolic 
make-up of an organism. 

A pathotoglcal condition or a xenobiotic may act at the pharmacological level only 
and hence may not affect gene regulatfon or expression directly. Alternatively 
significant disease or toxicologlcal effects may be completely unrelated to gene 
switching. For example, exposure to ethanol In vivo may switch on many genes but 
none of these gene expression events explains dmnkenness. In cases such as 
these, genomic and proteomic methods are likely to be ineffective. However, all 
disease or drug-Induced pathophysiological perturt)ations result In disturtjances In 
the ratios and concentrations, binding or fluxes of endogenous blochemicals. either 
by direct chemical reaction or by binding to key enzymes or nucleic adds that 
control metabolism. If these disturbances are of sufficient magnitude, effects will 
result which will affect the efficient functioning of the whole organism. In body 
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fluids, metabolites are in dynamic equilibrium with those inside cells and tissues 
and, consequently, abnomial cellular processes In tissues of the whole organism 
following a toxic insult or as a consequence of disease will be reflected in altered 
blofluid compositions. 

5 

Fluids secreted, excreted, or othenwise derived from an organism ("biofluids") 
provide a unique window into its biochemical status since the composition of a 
given blofluid is a consequence of the function of the cells that are intimately 
concerned with the fluid's manufacture and secretion. For example, the 
10 composition of a particular fluid can cany biochemical information on details of 
organ function (or dysfunction), for example, as a result of xenobiotics, disease, 
and/or genetic modification. Similarly, the composition and condition of an 
organism's tissues are also indicators of the organism's biochemical status. 
Examples of biofluids include, for example, urine, blood plasma, milk, etc. 

15 

Biofluids often exhibit very subtle changes in metabolite profile in response to 
extemal stimuli. This is because the body's cellular systems attempt to maintain 
homeostasis (constancy of internal environment), for example, in the face of 
cytotoxic challenge. One means of achieving this Is to modulate the composition of 
20 biofluids. Hence, even when cellular homeostasis is maintained, subtle responses 
to disease or toxicity are expressed in altered blofluid composition. However, 
dietary, diurnal and honnonal variations may also influence biofluid compositions, 
and it is clearly important to differentiate these effects if correct biochemical 
inferences are to be drawn from their analysis. 

25 

One of the most successful approaches to biofluid analysis has been the use of 
NMR spectroscopy (see, for example, Nicholson et al., 1989); similarly, intact 
tissues have been successfully analysed using magic-angle-spinning ^H NMR 
spectroscopy (see, for example, Moka et al., 1998; Tomlins et al., 1998). 



The NMR spectrum of a biofluid provides a metabolic fingerprint or profile of the 
organism from which the biofluid was obtained, and this metabolic fingerprint or 
profile is characteristically changed by a disease, toxic process, or genetic 
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modffication. For example, NMR spectra may be collected for various states of an 
organism, e.g., pre-dose and various times post-dose, for one or more xenoblotlcs, 
separately or in combination; healthy (control) and diseased animal; unmodified 
(control) and genetically modified animal. 

5 

For example, in the evaluation of undesired toxic side-effects of drugs, each 
compound or class of compound produces characteristic changes in the 
concentrations and pattems of endogenous metabolites In biofluids that provide 
information on the sites and basic mechanisms of the toxic process. NMR 

1 0 analysis of biofluids has successfully uncovered novel metabolic markers of 
organ-specific toxicity in the laboratory rat, and it is in this "exploratory" role that 
NMR as an analytical biochemistry technique excels. However, the biomarker 
infonnation in NMR spectra of biofluids is very subtle, as hundreds of compounds 
representing many pathways can often be measured simultaneously, and it is this 

1 5 overall metabonomic response to toxic insult that so well characterises the lesion. 

All biological fluids and tissues have their own characteristic physico-chemical 
properties, and these affect the types of NMR experiment that may be usefully 
employed. One major advantage of using NMR spectroscopy to study complex 

20 blomixtures Is that measurements can often be made with minimal sample 

preparation (usually with only the addition of 5-10% D2O) and a detailed analytical 
profile can be obtained on the whole biological sample. Sample volumes are small, 
typically 0.3 to 0.5 mL for standard probes, and as low as 3 pL for microprobes. 
Acquisition of simple NMR spectra is rapid and efficient using flow-injection 

25 technology. It is usually necessary to suppress the water NMR resonance. 

Many biofluids are not chemically stable and for this reason care should be taken in 
their collection and storage. For example, cell lysis in erythrocytes can easily 
occur. If a substantial amount of D2O has been added, then it is possible that 
30 certain NMR resonances will be lost by H/D exchange. Freeze-drying of biofluid 
samples also causes the loss of volatile components such as acetone. Biofluids 
are also very prone to microbiological contamination, especially fluids, such as 
urine, which are difficult to collect under sterile conditions. Many biofluids contain 



wo 02/052293 



PCT/GBOl/05685 



-7- 

significant amounts of active enzymes, either nomially or due to a disease state or 
organ damage, and these may enzymes may alter the composition of the blofluid 
following sampling. Samples should be stored deep frozen to minimise the effects 
of such contamination. Sodium azide is usually added to urine at the collection 
5 point to act as an antimicrobial agent. Metal ions and or chelating agents (e.g., 
EDTA) may be added to bind to endogenous metal ions (e.g., Ca^"^, Mg^**" and Zn^"^) 
and chelating agents (e.g., free amino acids, especially glutamate, cysteine, 
histidine and aspartate; citrate) to alter and/or enhance the NMR spectrum. 

In all cases the analytical problem usually involves the detection of "trace" amounts 
of analytes in a very complex matrix of potential interferences. It is, therefore, 
critical to choose a suitable analytical technique for the particular class of analyte of 
interest in the particular biomatrix which could be a blofluid or a tissue. High 
resolution NMR spectroscopy (in particular NMR) appears to be particularly 
appropriate. The main advantages of using NMR spectroscopy in this area are 
the speed of the method (with spectra being obtained in 5 to 10 minutes), the 
requirement for minimal sample preparation, and the fact that it provides a 
non-selective detector for all the abnormal metabolites In the blofluid regardless of 
their structural type, providing only that they are present above the detection limit of 
the NMR experiment and that they contain non-exchangeable hydrogen atoms. 
The speed advantage is of cmcial importance in this area of work as the clinical 
condition of a patient may require rapid diagnosis, and can change very rapidly and 
so correspondingly rapid changes must be made to the therapy provided. 

25 NMR studies of body fluids should ideally be perfomied at the highest magnetic 
field available to obtain maximal dispersion and sensitivity and most NMR 
studies have been perfomned at 400 MHz or greater. With every new increase in 
available spectrometer frequency the number of resonances that can be resolved in 
a blofluid increases and although this has the effect of solving some assignment 

30 problems, it also poses new ones. Furthemnore. there are still Important problems 
of spectral interpretation that arise due to compartmentatlon and binding of small 
molecules in the organised macromolecular domains that exist in some bloflulds 
such as blood plasma and bile. Ail this complexity need not reduce the diagnostic 
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capabilitles and potential of the technique, but demonstrates the problems of 
biological variation and the influence of variation on diagnostic certainty. 

The infomnation content of biofluid spectra is very high and the complete 
assignment of the NMR spectrum of most biofluids Is usually not possible (even 
using 900 MHz NMR spectroscopy, the highest frequency commercially available). 
However, the assignment problems vary considerably between biofluid types. 
Some fluids have near constant composition and concentrations and in these the 
majority of the NMR signals have been assigned. In contrast, urine composition 
can be very variable and there is enormous variation in the concentration range of 
NMR-detectable metabolites; consequently, complete analysis Is much more 
difficult. Those metabolites present close to the limits of detection for 
1 -dimensional (1 D) NMR spectroscopy (ca. 100 nM for many metabolites at 800 
MHz) pose severe NMR spectral assignment problems. (In absolute terms, the 
detection limit may be ca. 4 nmol, e.g., 1 pg of a 250 g/mol compound in a 0.5 mL 
sample volume.) Even at the present level of technology in NMR, it is not yet 
possible to detect many Important biochemical substances, e.g. hormones, proteins 
or nucleic acids In body fluids because of problems with sensitivity, line widths, 
dispersion and dynamic range and this area of research will continue to be 
technology-limited. In addition, the collection of NMR spectra of biofluids may be 
complicated by the relative water Intensity, sample viscosity, protein content, lipid 
content, low molecular weight peak overiap. 

Usually in order to assign ^H NMR spectra, comparison is made with spectra of 
authentic materials and/or by standard addition of an authentic reference standard 
to the sample. Additional conflnnation of assignments is usually sought from the 
application of other NMR methods, including, for example, 2-dimensional (2D) NMR 
methods, particulariy COSY (correlation spectroscopy), TOCSY (total con-elation 
spectroscopy). Inverse-detected heteronuclear correlation methods such as HMBC 
(heteronuclear multiple bond conelation), HSQC (heteronuclear single quantum 
coherence), and HMQC (heteronuclear multiple quantum coherence), 2D 
J-resolved (JRES) methods, spin-echo methods, relaxation editing, diffusion editing 
(Including both 1D NMR and 2D NMR such as diffusion-edited TOCSY), and 
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multiple quantum filtering. Detailed'^H NMR spectroscopic data for a wide range of 
metabolites and biomolecules found in blofluids have been published (see, for 
example, Lindon et al., 1999) and supplementary information is available in several 
literature compilations of data (see, for example, Fan, 1996; Sze et al., 1994). 

5 

For example, the successful application of NMR spectroscopy of biofluids to 
study a variety of metabolic diseases and toxic processes has now been well 
established and many novel metabolic markers of organ-specific toxicity have been 
discovered (see, for example, Nicholson et a!., 1989; Lindon et al.. 1999). For 

10 example, NMR spectra of urine is identifiably altered in situations where damage 
has occurred to the kidney or liver. It has been shown that specific and identifiable 
changes can be observed which distinguish the organ that is the site of a toxic 
lesion. Also it is possible to focus in on particular parts of an organ such as the 
cortex of the kidney and even in favourable cases to very localised parts of the 

15 cortex. Finally it is possible to deduce the biochemical mechanism of the xenobiotic 
toxicity, based on a biochemical interpretation of the changes in the urine. A wide 
range of toxins has now been investigated including mostly kidney toxins and liver 
toxins, but also testicular toxins, mitochondrial toxins and muscle toxins. 

20 However, a limiting factor in understanding the biochemical information from both 
ID and 2D-dimensional NMR spectra of tissues and biofluids is their complexity. 
The most efficient way to investigate these complex mulliparametric data is employ 
the 1 D and 2D NMR metabonomic approach in combination with computer-based 
"pattern recognition" (PR) methods and expert systems. These statistical tools are 

25 similar to those currently being explored by workers In the fields of genomics and 
proteomics. 

Pattern recognition (PR) is a general tenm applied to methods of data analysis 
which can be used to generate scientific hypotheses as well as testing hypotheses 
30 by reducing mathematically the many parameters. 



PR methods may be conveniently classified as "supervised" or "unsupervised." 
Unsupervised methods are used to analyse data without reference to any other 
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independent knowledge, for example, without regard to the identity or nature of a 
xenobiotic or its mode of action. 

Examples of unsupervised pattern recognition methods Include principal 
component analysis (PCA), hierarchical cluster analysis (HCA), and non-linear 
mapping (NLM). 

One of the most useful and easily applied unsupen^sed PR techniques Is principal 
components analysis (PCA) (see, for example, Sharaf, 1986). Principal 
components (PCs) are new variables created from linear combinations of the 
starting variables with appropriate weighting coefficients. The properties of these 
PCs are such that: (1) each PC is orthogonal to (uncorrelated with) all other PCs, 
and (11) the first PC contains the largest part of the variance of the data set 
(Information content) with subsequent PCs containing con-espondingly smaller 
amounts of variance. 

A data matrix, X, made up of rows where each row defines a sample, and columns, 
where each column defines a particular spectral descriptor, can be regarded as 
composed of a scores matrix, T, and a loadings matrix, L, such that X = TL*, where 
t denotes the transpose. The covarlance matrix, C, is calculated from the data 
matrix, X. The eigenvalues and eigenvectors of the covariance matrix are 
determined by diagonallsatlon. The coordinates in eigenvector plots (the principal 
components, PC^) are denoted "scores" and comprise the scores matrix T. The 
eigenvector coefficients are denoted "loadings" and comprise the loadings matrix L, 
and give the contributions of the descriptors to the PCs. 

Thus a plot of the first two or three PC scores gives the "best" representation, in 
tenns of Information content, of the data set in two or three dimensions, 
respectively. A plot of the first two principal component scores, PCI and PC2, is 
often called a "scores plot", and provides the maximum infonnation content of the 
data in two dimensions. Such PC maps can be used to visualise inherent 
clustering behaviour for drugs and toxins acting on each organ according to toxic 
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mechanlsm. Of course, the clustering information might be in lower PCs and these 
have also to be examined. 

In this simple metabonomic approach, a sample from an animal treated with a 
compound of unknown toxicity is compared with a database of NMR-generated 
metabolic data from control and toxin-treated animals. By observing its position on 
the PR map relative to samples of known effect, the unknown toxin can often be 
classified. However, toxicological data are often more complex, with time-related 
development of lesions and associated shifts in NMR-detected biochemistry. Also, 
it is more rigorous to compare effects of xenobiotics in the original n-dimensional 
NMR metabonomic space. 

Hierarchical Cluster Analysis, another unsupervised pattem recognition method, 
permits the grouping of data points which are similar by virtue of being "near" to 
one another in some multi-dimensional space whose coordinates are defined by the 
NMR descriptors which may be, for example, the signal intensities for particular 
assigned peaks in an NMR spectrum. A "similarity matrix," S, is constructed with 
elements Sy = 1 - ri^r{^^, where nj is the interpoint distance between points i and j 
(e.g., Euclidean interpoint distance), and-ry'"^ Is the largest interpoint distance for 
all points. The most distant pair of points will have sy equal to 0, since rq then 
equals nj"^. Conversely, the closest pair of points will have the largest sq, 
approaching 1. 

The similarity matrix is scanned for the closest pair of points. The pair of points are 
reported with their separation distance, and then the two points are deleted and 
replaced with a single combined point. The process is then repeated iteratively 
until only one point remains. A number of different methods may be used to 
determine how two clusters will be joined, including the nearest neighbour method 
(also known as the single link method), the furthest neighbour method, the centrold 
method (including centrold link, incremental link, median link, group average link, 
and flexible link variations). 
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The reported connectivities are ttien plotted as a dendrogram (a tree-like chart 
which allows visualisation of clustering), showing sample-sample connectivities 
versus increasing separation distance (or equivalently, versus decreasing 
similarity). The dendrogram has the property in which the branch lengths are 
5 proportional to the distances between the various clusters and hence the length of 
the branches linking one sample to the next is a measure of their similarity. In this 
way, similar data points may be identified algorithmlcally. 

Non-linear mapping (NLM) involves calculation of the distances between all of the 
10 points in the original multi-dimensional space. This is followed by construction of a 
map of points in 2 or 3 dimensions where the sample points are placed in random 
positions or at values detennined by a prior principal components analysis. The 
least squares criterion is used to move the sample points in the lower dimension 
map to fit the Inter-point distances In the lower dimension space to those in the 
15 higher dimensional space. Non-linear mapping is therefore an approximation to the 
true Inter-point distances, but points close in the original multi-dimensional space 
should also be close In 2 or 3 dimensional space (see, for example, Brown et al., 
1 996; Fan-ant et al., 1 992). 

20 Alternatively, and In order to develop automatic classification methods, it has 
proved efficient to use a "supervised" approach to NMR data analysis. Here, a 
"training set" of NMR metabonomic data Is used to constaict a statistical model that 
predicts correctly the "class" of each sample. This training set Is then tested with 
independent data ("test set") to detenmlne the robustness of the computer-based 

25 model. These models are sometimes temied "Expert Systems," but may be based 
on a range of different mathematical procedures. Supervised methods can use a 
data set with reduced dimensionality (for example, the first few principal 
components), but typically use unreduced data, with full dimensionality. In all 
cases the methods allow the quantitative description of the multivariate boundaries 

30 that characterise and separate each class, for example, each class of xenobiotic in 
temis of its metabolic effects. It is also possible to obtain confidence limits on any 
predictions, for example, a level of probability to be placed on the goodness of fit 
(see, for example, Sharaf, 1986). The robustness of the predictive models can also 



wo 02/052293 



-13- 



PCT/GB01/0568S 



be checked using cross-validation, by leaving out selected samples from the 
analysis. 

Expert systems may operate to generate a variety of useful outputs, for example, 
5 (i) classificalion of the sample as "nomiaP or "abnomial" (this is a useful tool in the 
control of spectrometer automation using sequential flow injection NMR 
spectroscopy); (ii) classification of the target organ for toxicity and site of action 
within the tissue where in certain cases, mechanism of toxic action may also be 
classified; and, (iii) identification of the biomarl<ers of a pathological disease 

10 condition or toxic effect for the particular compound under study. For example, a 
sample can be classified as belonging to a single class of toxicity, to multiple 
classes of toxicity (more than one target organ), or to no class. The latter case 
would indicate deviation from normality (control) based on the training set model 
but having a dissimilar metabolic effect to any toxicity class modelled In the training 

15 set (unknown toxicity type). Under (ii), a system could also be generated to support 
decisions in clinical medicine (e.g., for efficacy of drugs) rather than toxicity. 

Examples of supervised pattern recognition methods include the following, which 
are briefly described below: soft Independent modelling of class analysis (SIMCA) 

20 (see, for example, Wold, 1976); partial least squares analysis (PLS) (see, for 

example, Wold, 1966; Joreskog, 1982; Frank, 1984); linear descriminant analysis 
(LDA) (see. for example, Nillson, 1965); K-nearest neighbour analysis (KNN) (see, 
for example. Brown et al.. 1996); artificial neural networi^s (ANN) (see, for example, 
Wassemian, 1989; Anker et al.. 1992; Hare, 1994); probabilistic neural networks 

25 (PNNs) (see, for example, Parzen, 1962; Bishop, 1995; Speckt, 1990; Broomhead 
et al., 1988; Patterson, 1996); mie induction (Rl) (see, for example, Quinlan, 1986); 
and, Bayesian methods (see, for example, Bretthorst, 1990). 

As the size of metabonomic databases increases together with improvements in 
30 rapid throughput of NMR samples (> 300 samples per day per spectrometer is now 
possible with the first generation of flow injection systems), more subtle expert 
systems may be necessary, for example, using techniques such as "fuzzy logic" 
which pevmi greater flexibility in decision boundaries. 
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Pattern recognition methods have been applied to the analysis of metabonomic 
data, Including, for example, complex HMR data, with some success (see, for 
example, Anthony et al., 1994; Anthony et at., 1995; Beckwith-Hall et al., 1998; 
5 Ganiand et al., 1990a; Gartland et al.. 1990b; Gartland et al., 1991; Holmes et al., 
1998a; Holmes et al.. 1998b; Holmes et al., 1992; Holmes et al., 1994; Spraul et 
al., 1994; Tranter et al., 1999). 

Although the utility of the metabonomic approach is well established, there remains 
10 a great need for improved methods of analysis. The metabolic variation is often 
subtle, and powerful analysis methods are required for detection of particular 
analytes, especially when the data (e.g., NMR spectra) are so complex. 

One aim of the present invention is to provide data analysis methods for the 
1 5 detection of such metabolic variations, as part of a metabonomic approach. 

SUMMARY OF THE INVENTION 

One aspect of the present invention pertains to Improved methods for the analysis 
20 of chemical, biochemical, and biological data, for example spectra, for example, 
nuclear magnetic resonance (NMR) and other types of spectra. 

One aspect of the invention pertains to a method for processing a sample spectrum 
comprising: 

25 replacing each of one or more target regions in said sample spectmrn with a 

corresponding replacement region of a master control spectrum to give a target- 
replaced sample spectrum, 

wherein said replacement region has been scaled so as to have the same 
fraction of the total integrated intensity in said target-replaced sample spectrum as 

30 it did in said master control spectrum. 

One embodiment of the present invention pertains to a method for processing a 
sample spectrum for a test sample, said method comprising the steps of: 
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(a) Identifying, in said sample spectaim, one or more target regions for 
replacement; 

(b) providing a master control spectrum which comprises one replacement 
region corresponding to each of said target regions; and, 

(c) replacing each of said target regions with the corresponding replacement 
region to give a target-replaced sample spectrum, 

wherein said replacement region has been scaled so as to have the same 
fraction of the total integrated intensity in said target-replaced sample spectrum as 
it did in said master control spectrum. 

In one embodiment of the present invention, the method further comprises the 
subsequent step of: 

(d) normalising said target-replaced sample spectrum to give a nomialised 
target-replaced sample spectrum. 

One embodiment of the present invention pertains to a method for processing a 
sample NMR spectrum for a test sample, said method comprising the steps of: 

(a) identifying, in said sample NMR spectrum, one or more target regions for 
replacement, wherein each of said target regions is defined by a chemical shift 
range; 

(b) providing a master control NMR spectrum which comprises one 
replacement region corresponding to each of said target regions, wherein a target 
region and its conesponding replacement region are defined by the same chemical 
shift range; and, 

(c) replacing each of said target regions with the corresponding replacement 
region to give a target-replaced sample NMR spectrum, 

wherein said replacement region has been scaled so as to have the same 
fraction of the total integrated intensity in said target-replaced sample NMR 
spectrum as It did In said master control NMR spectmm. 

In one embodiment of the present Invention, the method further comprises the 
subsequent step of: 
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(d) normalising said target-replaced sample NMR spectrum to give a 
normalised target-replaced sample NMR spectrum. 

In one embodiment of the present invention, in said replacing step (c), each of said 
target regions is replaced with the corresponding replacement region to give a 
target-replaced sample spectrum, 

wherein said replacement region has been scaled by a factor, f, given by the 
formula: 




wherein: 

Iy is the total integrated intensity of the sample spectrum; 

ly.Tic is the integrated intensity of the target region; 

IcM is the total Integrated Intensity of the master control spectrum; 

lcM,Rk is the integrated intensity of the replacement region; 

k ranges from 1 to ni; and, 

nt Is number of target regions. 

Another aspect of the Invention pertains to a sample spectrum which has been 
processed by a method according to the present invention. 

Another aspect of the Invention pertains to a method for processing a plurality of 
sample spectra, comprising processing each of said sample spectra by a method 
according to the present invention. 

Another aspect of the Invention pertains to a method of analysis of an applied 
stimulus, comprising the steps of: 
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(a) providing one or more sample spectra for each of one or more samples 
from each of one or more organisms which have been subjected to said applied 
stimulus; 

(b) providing a master control spectrum derived from one or more control 
5 spectra for each of one or more samples from each of one or more organisms 

which have not been subjected to said applied stimulus; 

(c) processing each of said sample spectra using a method according to the 
present invention. 

10 In one preferred embodiment, the applied stimulus is a xenoblotic. In one preferred 
embodiment, the applied stimulus is a disease state. In one preferred embodiment, 
the applied stimulus is a genetic modification. 

Another aspect of the invention pertains to a method for identifying a biomarker or 
15 biomarker combination for an applied stimulus, comprising a method of analysis of 
an applied stimulus as described herein. 

Another aspect of the invention pertains to a biomarker or biomarker combination 
identified by such a method. 

20 

Another aspect of the invention pertains to a method of diagnosis of an applied 
stimulus employing a biomarker identified by such a method. 

Another aspect of the invention pertains to an assay, which employs a biomarker 
25 identified by a method as described herein. 

Another aspect of the invention pertains to a method of classifying an applied 
stimulus, comprising a method of analysis of an applied stimulus as described 
herein. 

30 

Another aspect of the invention pertains to a method of diagnosis of an applied 
stimulus, comprising a method of analysis of an applied stimulus as described 
herein. 
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Another aspect of the Invention pertains to a method of therapeutic monitoring of a 
subject undergoing therapy, comprising a method of analysis of an applied stimulus 
as described herein. 

5 

Another aspect of the invention pertains to a method of evaluating drug therapy 
and/or drug efficacy, comprising a method of analysis of an applied stimulus as 
described herein. 

10 Another aspect of the invention pertains to a method of detecting toxic side-effects 
of drug, comprising a method of analysis of an applied stimulus as described 
herein. 

Another aspect of the invention pertains to a method of characterising and/or 
15 identifying a drug in overdose, comprising a method of analysis of an applied 
stimulus as described herein. 

In one prefen^ed embodiment, the spectrum or spectra is an NMR spectrum or 
NMR-spectra. 

20 

Another aspect of the invention pertains to a computer system operatively 
configured to implement a method according the present invention. 

Another aspect of the invention pertains to computer code suitable for 
25 Implementing a method according to the present invention. 

Another aspect of the Invention pertains to a data carrier which carries computer 
code suitable for Implementing a method according the present invention on a 
suitable computer system. 

30 

As will be appreciated by one of skill in the art, features and prefen-ed embodiments 
of one aspect of the invention will also pertain to other aspects of the invention. 
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Figure 1 is a graph showing the four base spectra, denoted A, B, C, and D, which 
were used to generate the simulated data In the Examples. 

5 

Figure 2 is a graph showing the four animal factors, denoted AFa, AFb, AFc, and 
AFd, which were used to generate the simulated data in the Examples. 

Figure 3 is a graph showing the four time factors, denoted TFa, TFb, TFc, and TFd, 
to which were used to generate the simulated data in the Examples. 

Figure 4 is a graph showing spectra for animal number 6 (Ae) at the five time points 
(T1-T5), denoted (i), (ii), (Hi), (iv), and (v), respectively, as well as the master control 
spectrum. 

15 

Figure 5 is a graph showing, for animal number 6 at time point 2 (Ae.Ta), (i) the 
original spectrum, before replacement; (ii) the spectrum after spectral replacement; 
and (iii) spectrum (ii) after re-normalisatlon. 

20 Figure 6 is a graph showing, for animal number 6 at time point 3 (A^Ta), (i) the 
original spectmm, before replacement; (ii) the spectrum after spectral replacement; 
and (iii) spectrum (ii) after re-nomialisation. 

Figure 7 is a graph showing a scores plot (principal component 1 versus principal 
25 component 2) following principal component analysis of the sample spectra, 
wherein the spectral regions associated with the interfering signal were deleted 
from all spectra. 

Figure 8 is a graph showing a scores plot (principal component 1 versus principal 
30 component 2) following principal component analysis of the normalised target- 
replaced spectra, wherein the replaced regions were treated as "missing data." 
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Rgure 9 is a graph showing a scores plot (principal component 1 versus principal 
component 2) following principal component analysis of the normalised target- 
replaced spectra, wherein the replaced regions were not treated as "missing data.** 

5 DETAILED DESCRIPTION OF THE INVENTION 

The present Invention pertains generally to the field of chemometrics, 
metabonomlcs, and, more parllculariy, to methods for the analysis of biological 
data, particulariy spectra. 

10 

Biolooical Data 

The methods of the present invention are applicable to chemical, biochemical, and 
biological data, for example, spectra, and especially spectra generated using types 
15 of spectroscopy and spectrometry which are useful in chemical and biochemical 
(i.e., molecular) studies. 

The methods described herein facilitate more powerful analysis of spectral data. 
For example, the methods of 4he present Invention make possible the Identification 
20 of spectral changes associated with an event of interest from a spectral background 
which is non-specific and/or inrelevant. 

In the context of studies of organisms, the event of interest may be, for example, an 
applied stimulus. The temri "applied stimulus," as used herein, pertains to a 
25 stimulus under study which Is applied to, or is present in, an organism(s) under 
study, and is not applied to, and is absent in, a control organism(s). Examples of 
applied stimuli include, but are not limited to. a xenobiotic, a disease state, and a 
genetic modification. 

30 The term "xenobiotic," as used herein, pertains to a substance (e.g., compound, 
composition) which is administered to an organism, or to which the organism is 
exposed. In general, xenobiotics are chemical, biochemical or biological molecules 
which are not normally present in that organism, or are normally present in that 
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organism, but not at the level obtained following administration. Examples of 
xenoblotics Include drugs, formulated medicines and their components, pesticides, 
herbicides, substances present in foods (e.g. plant compounds administered to 
animals), and substances present in the environment. 

5 

The tenn "disease state," as used herein, pertains to a deviation from the normal 
healthy state of the organism. Examples of disease states include bacterial, viral, 
parasitic infections, cancer in ail its forms, degenerative diseases (e.g., arthritis, 
multiple sclerosis), trauma (e.g., as a result of injury), organ failure (including 
10 diabetes), cardiovascular disease (e.g., atherosclerosis, thrombosis), and inherited 
diseases caused by genetic composition (e.g.. sickle-cell anaemia). 

The temri "genetic modification," as used herein, pertains to alteration of the genetic 
composition of an organism. Examples of genetic modifications include the 
15 incorporation of a gene or genes into an organism from another species, increasing 
the number of copies of an existing gene or genes in an organism, removal of a 
gene or genes from an organism, rendering a gene or genes in an organism non- 
functional. 

20 Examples of the types of spectroscopy which give spectra suitable for the 

application of the methods of the present invention include, but are not limited to, 
the following: all regions of the electromagnetic spectrum, including, for example, 
microwave spectroscopy; far Infrared spectroscopy; infrared spectroscopy; Raman 
and resonance Raman spectroscopy; visible spectroscopy; ultraviolet 

25 spectroscopy; far ultraviolet (or vacuum ultraviolet) spectroscopy; x-ray 

spectroscopy; optical rotatory dispersion, circular dichroism (e.g., ultraviolet, visible 
and infrared); Mossbauer spectroscopy; atomic absorption and emission 
spectroscopy; ultraviolet fluorescence and phosphorescence spectroscopy; 
magnetic resonance, including nuclear magnetic resonance (NMR), electron 

30 paramagnetic resonance (EPR), MRI (magnetic resonance imaging); and mass 
spectrometry, including variations of Ionization methods, including electron impact, 
chemical ionisation, thermospray, electrospray, matrix assisted laser desorption 
ionization (MALDI), Inductively coupled plasma, and detection methods, Including 
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sector detection, quadrupole detection, ion-trap, time-of-f light, and Fourier 
transform. 

One particularly prefen-ed dass of spectroscopy is nuclear magnetic resonance 
(NMR). Examples of such methods include 1 D, 2D, and 3D-NMR, including, for 
example, ID spectra, such as single pulse, water-peak saturated, spin-echo such 
as CPMG (i.e., edited on the basis of nuclear spin relaxation times), diffusion- 
edited; 2D spectra, such as J-resolved (JRES), ^H-'H correlation methods such as 
NOESY, COSY, TOCSY and variants thereof, methods which correlated to 
heteronuclei (including, for example, ^^C, '^N, ^9F, and such as direct 
detection methods such as HETCOR and inverse-detected methods such as ^H- 
^^C HMQC, HSQC and HMBC; 3D spectra, including many variants, which are 
combinations of 2D methods, e.g. HMQC-TOCSY, NOESY-TOCSY, etc. All of 
these NMR spectroscopic techniques can also be combined with magic-angle- 
spinning (MAS) in order to study samples other than isotropic liquids, such as 
tissues or foodstuffs, which are characterised by anisotropic composition. 

Composite spectra, which are fonned from two or more spectra of different types, 
may also be used. 

The methods of the present Invention are applied to spectra obtained or recorded 
for particular samples under study. Samples may be in any form which is 
compatible with the particular type of spectroscopy, and therefore may be. as 
appropriate, homogeneous or heterogeneous, comprising one or a combination of, 
a gas, a liquid, a liquid crystal, a gel, or a solid, and including samples with a 
biological origin. 

Examples of such samples include those originating from an organism, for 
example, a whole organism (living or dead, e.g., a living human, a culture of 
bacteria); a part or parts of an organism (e.g., a tissue sample, an organ, a leaf); a 
pathological tissue such as a tumour; a tissue homogenate (e.g. a liver microsome 
fraction); an extract prepared from a organism or a part of an organism (e.g., a 
tissue sample extract, such as perchloric acid extract); an Infusion prepared from a 
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organism or a part of an organism (e.g., tea, Chinese traditional herbal medicines); 
an in vitro tissue such as a spheroid; a suspension of a particular cell type (e.g. 
hepatocytes); an excretion, secretion, or emission from an organism (especially a 
fluid); material which is administered and collected (e.g., dialysis fluid, lung aspirate 
5 fluid); material which develops as a function of pathology (e.g., a cyst, blisters); 
supernatant from a cell culture. 

Examples of fluid samples include, for example, urine, (gall bladder) bile, blood 
plasma, whole blood, cerebrospinal fluid, milk, saliva, mucus, sweat, gastric juice, 
10 pancreatic juice, seminal fluid, prostatic fluid, seminal vesicle fluid, seminal plasma, 
amniotic fluid, foetal fluid, follicular fluid, synovial fluid, aqueous humour, ascite 
fluid, cystic fluid, and blister fluid, plus cell suspensions and extracts thereof. 

Examples of tissue samples include liver, kidney, prostate, brain, gut, blood, 
15 skeletal muscle, heart muscle, lymphoid, bone, cartilage, and reproductive tissues. 

Still other examples of samples include air (e.g., exhaust), air condensates or 
extracts, water (e.g., seawater, groundwater, wastewater, e.g., from factories), 
liquids from the food industry (e.g. juices, wines, beers, other alcoholic drinks, tea, 
20 milk), solid-like food samples (e.g. chocolate, pastes, fruit peel, fruit and vegetable 
flesh such as banana, leaves, meats, whether cooked or raw, etc.). 

The sample may also be a concentrate of a fluid, for example, a concentrate of a 
fluid described above. 

25 

For samples which are, or are drawn from, an organism, the organism, in general, 
may be a prokaryote (e.g., bacteria) or a eukaryote (e.g., protoctista, fungi, plants, 
animals). 

30 The organism may be an alga or a protozoan. 
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The organism may be a plant, an anglospemi, a dicotyledon, a monocotyledon, a 
gymnospenn, a conifer, a ginkgo, a cycad, a fern, a horsetail, a dubmoss, a 
llvenwort, oramoss. 

The organism may be a chordate, an Invertebrate, an echlnoderm (e.g., starfish, 
sea urchins, brltllestars), an arthropod, an annelid (segmented worms) (e.g.. 
earthwonns, lugwomis, leeches), a mollusk (cephalopods (e.g., squids, octopi), 
pelecypods (e.g., oysters, mussels, clams), gastropods (e.g., snails, slugs)), a 
nematode (round womis), a platyhelminthes (flatworms) (e.g.. planarlans, flukes, 
tapeworms), a cnldaria (e.g., jelly fish, sea anemones, corals), or a porifera (e.g.. 
sponges). 

The organism may be an arthropod, an Insect (e.g., beetles, butterflies, moths), a 
chllopoda (centipedes), a diplopoda (millipedes), a crustacean (e.g., shrimps, 
crabs, lobsters), or an arachnid (e.g., spiders, scorpions, mites). 

The organism may be a chordate, a vertebrate, a mammal, a bird, a reptile (e.g., 
snakes, lizards, crocodiles), an amphibian (e.g., frogs, toads), a bony fish (e.g., 
salmon, plaice, eel. lungflsh), a cartilaginous fish (e.g., sharks, rays), or a jawless 
fish (e.g., lampreys, hagflsh). 

The organism may be a mammal, a placental mammal, a marsupial (e.g., 
kangaroo, wombat), a monotreme (e.g., duckbilled platypus), a rodent (e.g., a 
guinea pig, a hamster, a rat, a mouse), murine (e.g.. a mouse), avian (e.g., a bird), 
canine (e.g., a dog), feline (e.g., a cat), equine (e.g., a horse), porcine (e.g., a pig), 
ovine (e.g., a sheep), bovine (e.g., a cow), a primate, simian (e.g., a monkey or 
ape), a monkey (e.g., mamnoset, baboon), an ape (e.g., gorilla, chimpanzee, 
orangutang. gibbon), or a human. 

Furthennore. the organism may be any of its fomis. for example, a spore, a seed, 
an egg, a larva, a pupa, or a foetus. 
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Spectral Replacement 

Spectra often have features (e.g. peaks, noise spikes, baseline artefacts, etc) 
whicli interfere with and/or reduce the power and/or accuracy of subsequent 
5 analysis. Some of these features are artefacts of the particular types of spectra, its 
method of acquisition, adventitious impurities, and the like. However, more often 
these spectral features are chemical species not accidentally or unintentionally 
present in the sample under study. In order to improve the power and efficiency of 
subsequently spectral analysis, it is useful to identify and treat appropriately those 
10 parts of the spectra which are associated with such species. In addition, spectral 
features introduced unintentionally need to identified and treated appropriately. 

For example, in metabonomic studies, a sample from an organism under study may 
show spectral evidence of a large number of metabolites, some of which provide 
15 little or no useful information about the applied stimulus, yet interfere with 
subsequent data analysis. For example, spectral peaks from dmgs and their 
metabolites often dominate the metabonomic description of the dosed organism, 
but their identification and levels are sometimes of secondary Importance. 

10 In general, metabolites may be placed in one of three classes: 

(A) Endogenous metabolites, the levels of which are significantly altered by the 
application of the applied stimulus. A single metabolite of this type Is typically 
refen-ed to as a biomarker. In a more complex case, where the levels of several, or 

15 more, metabolites are changed (whether increased or decreased), the group of 
metabolites are typically referred to as a biomarker combination. For example, an 
increase in taurine together with creatine levels In urine is a general marker for liver 
damage. In a more complex example, toxins which cause lesions in the S3 portion 
of the renal proximal tubule cause elevations of urinary glucose, amino acids and 

30 organic acids with decreases in tricarboxylic acid cycle inlermediates. 

(B) Endogenous metabolites, the levels of which are unaffected by application of 
the applied stimulus. 
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(C) Metabolites, which appear In the sample and which arise from a xenoblotic itself 
or its metabolites. For example, paracetamol is seen in urine mainly as 
paracetamol sulfate and paracetamol glucuronide conjugates. In some cases 
5 unchanged paracetamol can also be seen. Of course, these metabolites will be 
present only if the applied stimulus includes a xenobiotic. 

Metabolites falling in class B, and many of those metabolites falling in class C, i.e. 
not biomarkers or blomarker combinations, collectively referred to herein as 
10 "Interfering signals", often provide little infomnation about the organism's response 
to an applied stimulus, while dominating and Interfering with the metabonomic 
description of the stimulated organism. 

Whether or not a particular metabolite is, or is a candidate as, an interfering signal 
15 can often be detemilned from known data regarding the applied stimulus under 
study. For example, there may be a large body of public knowledge regarding the 
metabolism, of a particular compound, or of compounds having a particular 
substructure. Often, an Interfering signal, and its associated spectral features, can 
be readily Identified by eye by the skilled artisan. However, if new spectral features 
20 are observed which are not readily Identified, the associated compounds giving rise 
to these features can be isolated and characterised using known methods, for 
example, by coupling liquid chromatography with NMR or mass spectrometry. 

In some methods, those parts of the spectrum associated with these interfering 
25 signals are excised. However, when comparing or combining data from several 
studies (e.g., using different xenobiotics, different disease states, etc.), these parts 
of the spectrum are effectively deleted from all spectra in a combined data set. The 
deleted regions can encompass a large fraction of the total spectral region, 
significantly reducing the infomnation content of the combined set of spectra, and 
30 thereby reducing the power and efficiency of subsequently applied pattem 
recognition methods. 
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In some known methods, the excised parts of the spectrum are "filled," for example, 
by replacing the excised spectral data with, for example, zero intensity values 
("zero fill"); with an arbitrary or predetermined constant intensity value ("constant 
fill"); a random intensity value ("random fill"); a mean intensity value ("mean fill") 
5 calculated from the entire dataset; or an intensity value based on a principal 
component analysis ("principal component fill"). 

However, rather than simply deleting, or deleting and subsequently filling, these 
spectral regions, it is desirable to employ a method of "spectral replacement" in 

10 which these spectral regions are replaced with meaningful data, for example, 
corresponding scaled spectral regions from normal or control spectra (e.g., in the 
case of organism studies, spectra associated with normal or control organisms). 
Subsequent normalisation may further improve the data content, by scaling the 
peak intensities to values which, in a sense, they would have had if the interfering 

15 features (e.g. peaks) had not been included. 

Therefore, whether for the metabonomic reasons discussed above, or for other 
reasons, the spectrum is subjected to the additional step of "spectral replacement" 
as described herein. In general, spectral replacement is performed following 
20 acquisition of the spectrum (or spectra), including the normal pre-processing 
associated with the particular type of spectrum (e.g., signal averaging, Fourier 
transfonmation, baseline correction, etc.), but before subsequent analysis. 

One aspect of the present Invention pertains to a method for processing a sample 
25 spectrum comprising replacing each of one or more target regions in said sample 
spectrum with the corresponding replacement region of a master control spectnjm 
to give a target-replaced sample spectrum, wherein the replacement region has 
been scaled so as to have the same fraction of the total integrated intensity in said 
target-replaced sample spectrum as It did in said master control spectrum. 

30 

Thus, one embodiment of the present invention pertains to a method for processing 
a sample spectrum for a test sample, said method comprising the steps of: 
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(a) identifying, in said sample spectrum, one or more target regions for . 
replacement; 

(b) providing a master control spectrum which comprises one replacement 
region corresponding to each of said target regions; and, 

(c) replacing each of said target regions with the corresponding replacement 
region to give a target-replaced sample spectrum, wherein said replacement region 
has been scaled so as to have the same fraction of the total Integrated intensity in 
the target-replaced sample spectrum as it did in the master control spectmm. 

In a preferred embodiment, the methods further comprise a subsequent step of: 

(d) normalising said target-replaced sample spectrum to give a normalised 
target-replaced sample spectrum. 

Another embodiment of the present invention pertains to a method for processing a 
sample NMR spectrum for a test sample, said method comprising the steps of: 

(a) identifying, in said sample NMR spectrum, one or more target regions for 
replacement, wherein each of said target regions Is defined by a chemical shift 
range; 

<b) providing a master control NMR spectrum which comprises one 
replacement region conesponding to each of said target regions, wherein a target 
region and its corresponding replacement region are defined by the same chemical 
shift range; and, 

(c) replacing each of said target regions with the corresponding replacement 
region to give a target-replaced sample NMR spectrum, wherein said replacement 
region has been scaled so as to have the same fraction of the total integrated 
Intensity in sald target-replaced sample NMR spectrum as it did in said master 
control NMR spectrum. 

In a prefen-ed embodiment, the methods further comprise a subsequent step of: 

(d) nomialising said target-replaced sample NMR spectrum to give a 
nonnallsed target-replaced sample NMR spectrum. 
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Note that, in each of the above methods, step (b) may be perfomiecl either before 
or after step (a). 

The term "sample spectrum," as used herein, pertains to an spectrum obtained 
5 from a sample under study. If there are several sample spectra, as is typically the 
case, each one is treated separately. 

The sample spectrum is, to one degree or another, representative of the 
composition of the sample. In general, a sample can be generalised as an 

10 n-dimensional object, where the coordinate along each of the axes or dimensions is 
the concentration of individual chemical or biochemical species. Equlvalentiy, the 
sample can be represented via its spectrum, also as an n-dimensional object, y, 
where the coordinate along each of the axes or dimensions (yi, y2, ya, y/) is the 
spectral intensity (or equivalent parameter) at each data point. For example, for a 

15 1D NMR spectrum, each of yi, ya, ya, etc. may represent signal intensity at different 
chemical shifts. It is not necessary to assign spectral features (e.g., peaks, 
features, lines) at this stage, since it is treated solely as a statistical object. 

A sample spectra set, Y, may be fonmed from ny sample spectra, each of which is 
20 denoted yi (where i runs from 1 to ny) and each of which has descriptors yq (where j 
ranges from 1 to the total number of descriptors). Each sample spectrum, i, has a 
total Integrated intensity, lyj, given by: 

j 

25 

As mentioned above, the target regions are one or more spectral regions in the 
sample spectrum which are to be replaced. Each of one or more target regions in 
the sample spectrum is replaced with the corresponding and appropriately scaled 
replacement region of a master control spectrum. The target regions for the ith 
30 sample spectaim may be denoted ti,k, where k ranges from 1 to nt, and ni is the 
number of target regions. In metabonomic studies, the target regions typically 
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pertain to, relate to, or otherwise reflect low correlation metabolites, as discussed 
above. 

The master control spectrum may be a single spectrum, referred to herein as a 
"control spectrum," or more preferably it is an average spectrum calculated from 
two or more control spectra. Where the spectra are associated with an organism, 
the master control spectrum may be a single spectrum from a control organism, 
referred to herein as a "control spectrum," or more preferably it Is an average 
spectrum calculated from two or more control spectra. The control spectra may be 
obtained from a single control organism, or, more preferably, from two or more 
control organisms. In the context of studies of organisms, the stimulus under study 
is not applied to, nor is present in, the control organism(s). 

For example, a control spectra set, C, may be fomned from nc control spectra, each 
of which is denoted Ci and has descriptors cg, where j runs between 1 and the 
number of descriptors. The master control spectrum, cm, having descriptors Cmj, 
may be calculated as: 



The master control spectrum has a total integrated intensity, Icm, given by: 



The master control NMR spectrum comprises one replacement region 
corresponding to each of the target regions. The term "replacement region(s)," as 
used herein, pertains to that part/those parts of the master control spectaim which 
correspond(s) to the target region(s) of the sample spectrum. For example, if the 
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spectrum is a 1 D NMR spectrum, and a particular target region is defined as 

5 7.2-7.7 in the sample NMR spectrum, then the corresponding replacement region 

is also defined as 6 7.2-7.7 in the master control NMR spectrum. 

Each replacement region(s) is scaled so that it represents the same fraction of the 
total integrated intensity in the target-replaced sample spectrum as It did in the 
master control spectrum. For example, if a replacement region represented 2% of 
the total intensity in the master control spectrum, then it must also account for 2% 
of the total intensity in the target-replaced sample spectrum. 

For example, consider the case where the sample spectrum, with integrated 
intensity ly, has a single target region with integrated intensity It. The remainder of 
the spectrum has an integrated intensity of ly-lj. The master control spectrum has 
an integrated intensity of )cm, and the replacement region therein has an integrated 
intensity of Ir. The fraction of the total integrated intensity in the master control 
spectrum accounted for by the replacement region is Ir/Icm. The replacement 
region is scaled by a factor, f, and thus the scaled replacement region has an 
integrated intensity of fiR. The target replaced spectrum now has an integrated 
- intensity of Iy - It + flp. 

The scale factor, f , is selected so that scaled replacement region (intensity flp) has 
the same fraction of the total integrated intensity in the target-replaced sample 
spectrum (f lR/[f Ir+Iy-It]) as it did in the master control spectrum (Ir/Icm), that is, 
fln/lf IR+Irlj] = Ir/Icm- Rearranging this equation gives: 

f= 

Consider also the case where the sample spectrum, with integrated intensity Iy, has 
two target region with Integrated intensities Iti and It2, respectively. The remainder 
of the spectmm has an integrated intensity of Iy"Iti-It2. The master control 
spectrum has an Integrated intensity of Icm, and the respective replacement regions 
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therein have integrated intensities of Iri and Ir2, respectively. The fraction of the 
total integrated intensity in the master control spectmm accounted for by the first 
and second replacement regions is Iri/Icm and Ina/lcM, respectively. The first 
replacement region is scaled by a factor, fi, and thus the scaled first replacement 
5 region has an integrated intensity of filni. The second replacement region is scaled 
by a factor, fz, and thus the scaled second replacement region has an integrated 
intensity of f2lR2. The target replaced spectrum now has an Integrated Intensity of 
Iv-lT+filRi+fate. 

10 The scale factors, fi and f2, are selected so that each scaled replacement region 
(intensities film and f2lR2, respectively) has the same fraction of the total integrated 
Intensity in the target-replaced sample spectmm (filRi/IlY-lr+filRi+falRa] and falRa/Ilv- 
lT+filRi+f2lR2l, respectively) as it did in the master control spectrum (Iri/Icm and 
Ir2/Icm, respectively). This gives two simultaneous equations: filRi/[lY-lT+filRi+f2lR2l 

15 = Iri/Icm and f2lR2/[lY-lT+filRi+f2lR2] = Irs/Icm, from which it can be shown that; 

f,=f2=f= 

In the general case, the target regions for the Ith sample spectrum (Y|) are denoted 
20 tyt, the corresponding replacement regions are denoted tk, and In both cases, k 
ranges from 1 to nt, where nt is the number of target regions. 

For the kth target region of the ith sample spectrnm, denoted Um, the Integrated 
intensity, lYi,Tk is calculated as: 

25 

^YLTk = 2 
j 

where the sum is over the descriptors, j, of that target region. 
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Similarly, for the replacement region of the master control spectrum, denoted tk 
(corresponding to the kth target region of the ith sample spectrum, ti.k), the 
integrated intensity is calculated as: 



where the sum is over the descriptors, j, of that replacement region. 

Thus, generalising the above examples, it may be shown that where there are 
many target regions, the scale factor for the ith sample spectrum is given by: 



Iyi is the total Integrated intensity of the sample spectrum (before 
replacement); 

Iyi,ti< is the integrated intensity of the target region in question; 
IcM is the total integrated intensity of the master control spectrum; 
lcM.Bk is the integrated intensity of the replacement region In question; and, 
and k ranges from 1 to nt, the number of target regions. 

Thus, prior to replacement of the target region, yi,k, by its conresponding 
replacement region, fk, that replacement region Is scaled by (I.e., multiplied by) a 
factor, fi, given above. In this way, for each sample spectrum and for each target 
region therein, a target region, yi,k, of integrated intensity lYi.Rk is replaced by a 
replacement region, rk, of integrated Intensity filcM.Rk. 
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The fourth step of the method, which is optional, but which is preferred, involves 
normalising the target-replaced sample spectrum to give a "nomnalised target- 
replaced sample spectrum." Nonnalisation is typically achieved by scaling the 
target-replaced sample spectmm to give unit total integrated intensity, that is. by 
scaling by a factor of 1 divided by the total Integrated Intensity of the target- 
replaced sample spectrum, and thus may be expressed by the following fomiula: 



wherein yy^ denotes the descriptors of the target-replaced sample spectrum, and 
yi)^'^ denotes the descriptors of the nonnalised target-replaced sample spectrum. 

Once the spectra have been processed as described above, they may be subjected 
to further analysis as appropriate for the particular type of spectmm. A variety of 
known analysis methods may be employed, including, for example, those described 
in Press etal., 1983. 

For example, for NMR spectra, conventional pattern recognition methods such as 
principal component analysis (PGA) may be applied. For example, it may be 
desirable to perfonn PGA using target-replaced spectra, or, more preferably, 
nomnalised target-replaced spectra. Simllarty, it may or may not be desirable to 
treat the target-replaced regions as "missing data." 

Implementation 





The methods of the present invention may be conveniently perfonned 
electronically, for example, using a suitably programmed computer system. 
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Thus, one aspect of the present Invention pertains to a computer system or device, 
such as a computer or linlced computers, operatively configured to implement the 
methods of the present invention. 

5 Another aspect of the present invention pertains to computer code suitable for 
implementing the methods of the present invention on a suitable computer system. 

In one embodiment, the present invention pertains to a computer program 
comprising computer program means adapted to perform a method according to 
10 the present Invention when the program is run on a computer. 

Another aspect of the present invention pertains to a data carrier which carries 
computer code suitable for implementing the methods of the present invention on a 
suitable computer. 

15 

In one embodiment, the present invention pertains to a computer program, as 
described above, embodied on a computer readable medium. 

Examples of data carriers and computer readable media include chip media (e.g., 
20 ROM, RAM, flash memory (e.g., Memory Sticl<™, Compact Flash™, 

Smartmedia™), magnetic disk media (e.g., floppy disl<s, hard drives), optical disk 
media (e.g., compact disks (CDs), digital versatile disks (DVDs), magneto-optical 
disks), and magnetic tape media. 

25 Processing of NMR Spectra 

Following data acquisition and initial pre-processing, but preceding the application 
of subsequent analysis (e.g., pattern recognition), the data is subjected to 
additional pre-processing, including a step of "spectral replacement" as described 
30 herein. 

NMR spectra are typically acquired, and subsequently, handled In digitised fomn. 
Conventional methods of spectral pre-processing of (digital) spectra are well 
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known, and include, where applicable, signal averaging, Fourier transfomiatlon 
(and other transfomiatlon methods), phase conrection, baseline conection, 
smoothing, and the like (see, for example, Lindon et al., 1980). 

Modem spectroscopic methods often perniit the collection of high or very high 
resolution spectra. In digital fomi, even a simple spectrum (e.g., signal intensity 
versus some function of energy or frequency) may have many thousands, If not 
tens of thousands of data points. It is often desirable to reduce or compress the 
data to give fewer data points, for both practical computing methods and also to 
effect some degree of signal averaging to compensate for physical effects, such as 
pH variation, compartmentalisation, and the like. 

For example, a typical NMR spectrum is recorded as signal intensity versus 
frequency. NMR signals from nuclei have a characteristic position on this axis 
called a chemteal shift. This is the frequency of observation relative to that of a 
reference signal. When tills is divided by the observation frequency, this chemical 
shift is dimensionless, Is given in parts per million (ppm) and is denoted by tiie 
symbol 6. For brevity this axis will be temied ttie chemical shift axis. For NMR 
spectra, tills ranges from about 50to510. Ata typical frequency resolution of 
about 10"* to 10"^ ppm, the spectrum in digRal form comprises about 10,000 to 
100,000 data points (typically 2 to ttie power 16, or 64k, or 65536). 

As discussed above, it Is often desirable to compress this data, for example, by a 
factor of about 1 0 to 1 00, to about 1 000 descriptors. 

For example, in one approach, the chemical shift axis, 5. Is "segmented" Into 
"buckets" or "bins" of a specific length. For a 1-D NMR spectnim which spans 
tiie range from 5 0 to 5 10, using a bucket lengtti, A6, of 0.04 yields 250 buckets, 
for example, 6 10.0-9.96, 5 9.96-9.92, 6 9.92-9.88, etc. The signal intensity within 
a given bucket may be averaged or integrated, and tiie resulting value reported. In 
this way, a spectoim witii, for example. 1 00,000 original data points can be 
compressed to an equivalent representation with, for example, 250 data points. 
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A similar approach can be applied to 2-D spectra, 3-D spectra, and the like. For 
2-D spectra, the "bucket" approach may be extended to a "patch." For 3-D spectra, 
the "bucket" approach may be extended to a "volume." For example, a 2-D 
NMR spectrum which spans the range from 5 0 to 5 10 on both axes, using a patch 
of A6 0,1 x A5 0.1 yields 10,000 patches. In this way, a spectrum with perhaps 10® 
original data points can be compressed to an equivalent spectrum of 10"^ data 
points. 

Software for such processing of NMR spectra, for example AMIX (Analysis of 
Mixture, V 2.5, Bmker Analytik, Rheinstetten. Gennany) is commercially available. 

Often, certain spectral regions carry no real diagnostic Information, or carry 
conflicting biochemical infonnation, and it is often useful to remove these 
"redundant" regions before performing detailed analysis. In the simplest approach, 
the data points are deleted. In another simple approach, the data In the redundant 
regions are replaced with zero values. 

For example, due to the dynamic range problem with water in comparison with 
other molecules, the water resonance (around 6 4.7) is suppressed. However, 
small variations in water suppression remain, and these variations can undesirably 
complicate analysis. Similarly, variations in water suppression may also affect the 
urea signal (around 6 5.5), by cross saturation. Therefore, it is often useful to 
delete the certain spectral regions, for example, from about 6 4.5 to 6,0 
(e.g.. 6 4.52 to 6,00). 

Certain metabolites exhibit a strong degree of physiological variation (e.g., diurnal 
variation, dietary-related variation) that Is unrelated to any pathophysiological 
process. Such variation may undesirably complicate analysis, and mask more 
relevant details. Therefore, it may be useful to delete the spectral regions 
associated with such compounds. However, it is often possible to isolate these 
effects in later (e.g., pattern recognition) analysis. 
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Xenoblotics (e.g., dmgs) and their metabolites otten give rise to large signals which 
do not directly correlate to the conditions (e-g., pathologies) which are induced by 
the xenobiotic. Therefore, it is otten useful to delete the spectral regions 
associated with such compounds. 

In general, NMR data is handled as a data matrix. Typically, each row in the matrix 
corresponds to an individual sample (often referred to as a "data vector"), and the 
entries in the columns are, for example, spectral intensity of a particular data point, 
at a particular 5 or A5 (often referred to as "descriptors"). 

It is often useful to pre-process data, for example, by addressing missing data, 
translation, scaling, and weighting. 

If at all possible, missing data, for example, gaps in column values, should be 
avoided. However, if necessary, such missing data may replaced or "filled" with, for 
example, the mean value of a column ("mean fill"); a random value ("random fill"); 
or a value based on a principal component analysis ("principal component fill"). 
Each of these different approaches will have a different effect on subsequent PR 
analysis. 

Translation" of the descriptor coordinate axes can be useful. Examples of such 
translation include normalisation and mean centring. 

"Normalisation" may be used to remove sample-to-sample variation. Many 
nomiaiisation approaches are possible, and the can often be applied at any of 
several points in the analysis. Usually, nonnalisation is applied after redundant 
spectral regions have been removed. In one approach, each spectrum is 
nomialised (scaled) by a factor of 1/A, where A is the sum of the absolute values of 
all of the descriptors for that spectnim. In this way, each data vector has the same 
length, specifically, 1 . For example, if the sum of the absolute values of intensities 
for each bucket in a particular spectmm is 1067, then the intensity for each buclcet 
for this particular spectrum is scaled by 1/1067. 
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"Mean centring" may be used to simplify interpretation. Usually, for each 
descriptor, the average value of that descriptor for all samples is subtracted. In this 
way, the mean of a descriptor coincides with the origin, and ail descriptors are 
"centred" at zero. For example, if the average intensity at 6 10.0-9.96, for all 
5 spectra, is 1 .2 units, then the intensity at 5 10.0-9.96, for ail spectra, is reduced by 
1.2 units. 

In "unit variance (UV) scaling," data can be scaled to equal variance. Usually, the 
value of each descriptor is scaled by 1/StDev, where StDev is the standard 

1 0 deviation for that descriptor for all samples. For example, if the standard deviation 
at 5 10.0-9.96, for all spectra, is 2.5 units, then the intensity at 6 10.0-9.96, for all 
spectra, is scaled by 1/2.5 or 0.4. Unit variance scaling may be used to reduce the 
impact of "noisy" data. For example, some metabolites in biofluids show a strong 
degree of physiological variation (e.g., diurnal variation, dietary-related variation) 

15 that is unrelated to any pathophysiological process. Without unit variance scaling, 
these noisy metabolites may dominate subsequent analysis. 

"Logarithmic scaling" may be used to assist interpretation when data have a 
positive skew and/or when data spans a large range, e.g., several orders of 
20 magnitude. Usually, for each descriptor, the value is replaced by the logarithm of 
that value. For example, the intensity at 6 ,10.0-9.96 is replaced the logarithm of 
the intensity at 5 10.0-9.96, for all spectra. 

In "equal range scaling," each descriptor is divided by the range of that descriptor 
15 for all samples. In this way, all descriptors have the same range, that is, 1 , For 
example, if, at 6 10.0-9.96, for all spectra, the largest value is 87 units and the 
smallest value is 1 , then the range is 86 units, and the intensity at 5 10.0-9.96, for 
all spectra, is divided by 86 units. However, this method is sensitive to presence of 
outlier points. 



In "autoscaling," each data vector is mean centred and unit variance scaled. This 
technique is a very useful because each descriptor is then weighted equally and, in 
the case of NMR descriptors, large and small peaks are treated with equal 
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emphasis. This can be Important for metabolites present at very low levels but still 
NMR-detectable. 

Several supen/lsed methods of scaling data are also known. Some of these can 
provide a measure of the ability of a parameter (e.g., a descriptor) to discriminate 
between classes, and can be used to Improve classification by stretching a 
separation. 

For example, in "variance weighting," the variance weightof a single parameter 
(e.g., a descriptor) is calculated as the ratio of the Inter-class variances to the sum 
of the intra-class variances. A \arge value means that this variable is discriminating 
between the classes. For example, if the samples are known to fall into two 
classes (e.g., a training set), it is possible to examine the mean and variance of 
each descriptor. If a descriptor has very different mean values and a small 
variance, then It will be good at separating the classes. 

"Feature weighting" is a more general description of variance weighting, where not 
only the mean and standard deviation of each descriptor is calculated, but other 
well known weighting factors, such as the Rsher weight, are used. 

Spurious or irregular data ("outliers"), which are not representative, are preferably 
identified and removed. Common reasons for inegular data ("outliers") include 
poor phase correction, poor baseline correction, poor chemical shift referencing, 
poor water suppression, bacterial contamination, shifts in the pH of the bipfluid, 
toxin- or disease-induced biochemical response, and idiosyncratic response to 
xenobiotics. 

Outliers are identified in different ways depending on the method of analysis used. 
For example, when using principal component analysis (PCA), small numbers of 
samples lying far from the rest of the replicate group can be identified by eye as 
outliers. A more objective means of identification for PCA is to use the Hotelling's T 
Test which is the multivariate version of the well known Student's T test used in 
univariate statistics. For any given sample, the 12 value can be calculated and this 
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is compared with a standard value within which a chosen fraction (e.g., 95%) of the 
samples would normally lie. Samples with T2 values substantially outside this limit 
can then be flagged as outliers. Also, when using more sophisticated supervised 
methods, such as SIMCA or PNNs, a similar method is used. A confidence level 

5 (e.g., 95%) is selected and the region of multivariate space corresponding to 
confidence values above this limit is detennined. This region can be dispiayed 
graphically in several different ways (for example by plotting tiie critical T2 ellipse 
on a PCA scores plot). Any samples falling outside the high confidence region are 
flagged as potential outliers. Naturally, such samples are investigated in detail to 

10 determine the causes of their outlying nature before removing them from the model. 

Applications 

As discussed above, the methods of the present invention may be used In the 
1 5 analysis of chemical, biochemical, and biological data. 

Metabonomic methods, in conjunction with the methods of the present invention, 
provide powerful means for the diagnosis, prognosis, and treatment of disease, for 
understanding the benefits and side-effects of xenobiotic compounds thereby aiding 
10 the drug development process, as well as for improving therapeutic regimes for 
current drugs. 

For example, applications of metabonomic methods, in conjunction with the 
methods of the present invention, include, but are not limited to, early detection of 

25 abnormality/problem; differential diagnosis (classification of disease); prognosis 
(prediction of future outcome); therapeutic monitoring; identifying, classifying, 
determining the progress of, and monitoring the treatment of, infectious diseases; 
clinical evaluations of drug therapy and efficacy; detection of toxic side-effects of 
drugs and model compounds (e.g., in the drug development process and in clinical 

30 trials); investigation of idiosyncratic toxicity; characterization and identification of 
drugs used in overdose; classification, fingerprinting, and diagnosis of metabolic 
diseases (e.g., inborn en-ors of metabolism); improvement in the quality control of 
transgenic animal models of disease; aiding the design of transgenic models of 
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disease; and searching for new biochemical marl<ers of disease and/or tissue or 
organ damage. 



Metabonomic methods, In conjunction with the methods of the present invention, 
5 may be used as an alternative or adjunct to the various genomic, 

phamnacogenomic, and proteomic methods, including those described above. 

Metabonomic methods, in conjunction wWi the methods of the present invention, 
may also be used to identify (known or novel) genotypes and/or phenotypes, and to 

10 determine an organism's phenotype or genotype. This may assist wtth the choice 
of a suitable treatment or allow assessment of its relevance in a drug development 
process. For example, the generation of metabonomic data in panels of individuals 
with disease states, infected states, or undergoing treatment may Indicate response 
profiles of groups of Individuals which can be differentiated into two or more 

15 subgroups, indicating that an allelic genetic basis for response to the disease, 
state, or treatment exists. For example, a particular phenotype may not be 
susceptible to treatment with a certain drug, while another phenotype may be 
susceptible to treatment. Conversely, one phenotype might show toxicity because 
of a failure to metabolise and hence excrete a drug, which drug might be safe in 

20 another phenotype as it does not exhibit this effect. For example, metabonomic 
methods may be used to determine the acetylator status of an organism: there are 
two phenotypes, corresponding to "fast" and "slow" acetylation of drug metabolites, 
Phenotyping may be achieved on the basis of the urine alone (i.e., without dosing 
a xenobiotic), or on the basis of urine following dosing with a xenobiotic which has 

25 the potential for acetylation (e.g., galactosamine). Similar methods may also be 
used to detennine other differences, such as other enzymatic polymorphisms, for 
example, cytochrome P450 polymorphism. 

Metabonomic methods, in conjunction with the methods of the present invention, 
30 may also be used in studies of the biochemical consequences of genetic 

modification, for example, in "knock-out animals" where one or more genes have 
been removed or made non-functional; in "knock-in" animals where one or more 
genes have been incorporated from the same or a different species; and in animals 
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where the number of copies of a gene has been increased, as in the model which 
results in the over-expression of the beta amyloid protein in mice brains as a model 
for Alzheimer's disease). Genes can be transferred between bacterial, plant and 
animal species. 

5 

The combination of genomic, proteomic, and metabonomic data sets into 
comprehensive "bionomic" systems may permit an holistic evaluation of perturbed 
in vivo function. 

10 The methods of the present invention are also useful in other applications, including 
investigations into the effects of environmental pollutants (e.g., wastewater 
analysis, animal population studies, studies of invertebrates, marine organisms), 
and the effects of xenobiotic stimuli and genetic changes In plants. 

15 EXAMPLES 

The following examples are provided solely to illustrate the present invention and 
are not intended to limit the scope of the invention, as described herein. 

20 The methods of the present invention have been exemplified in their application to 
NMR spectra. Nonetheless, the methods of the present invention are similarly 
applicable to other types of spectra, such as those discussed above. 

A spectral data set consisting of 75 spectra was simulated, representing spectra 
25 taken at five time points (Ti, T2, T3, T4. and T5) for three groups of five animals 
(A1-A5, Ae-Aio. and A11-A15). The first group of animals (ArAs) were control 
animals. The second group of animals (Ae-Aio) were dosed animals. The third 
group of animals (An-Ais) were also dosed animals, but differently so (for example, 
with a different drug/toxin, or a different amount of the same drug/toxin), 

30 

The data set was generated using a PARAFAC model (see, for example, Bro. 
1997). In this model, the generated spectra were linear combinations of the four 
base spectra (denoted A, B, C, and D) shown in Figure 1, where chemical shift 
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(represented by spectral bin number) is along the x-axis, and spectral intensity is 
along the y-axis. The contribution of each base spectmm is determined by two 
con-esponding factors, the animal factor and the time factor, discussed below. 

5 The animal factors (denoted AFa, AFb, AFc, and AFd) are shown in Figure 2, where 
the animal number (A1-A15) is along the x-axis, and the animal factor is along the 
y-axIs. Thus, for each base spectmm and animal, there is an animal factor, e.g., 
AFb-at for base spectrum B and animal 7. 



10 The time factors (denoted TFa, TFb, TFc, and TFd) are shown in Rgure 3, where 
the time point (T1-T5) is along the x-axis and the time factor is along the y-axis. 
Thus, for each base spectrum and time point, there is a time factor, e.g., TFb-ts for 
base spectrum B and time point 3, 

15 For example, the spectrum for animal number 7 (A7) at time point 3 (T3) is a linear 
combination of the four base spectra (A, B, C, D), with coefficients (AFa.a7*TFa-i^), 
(AFb.a7*TFb.t3), (AFoa7*TFc.t3), and (AF|>.a7*TFd-t3), respectively. 

For example, spectra for animal number 6 (Ae) at the five time points (T1-T5) are 
20 shown In Figure 4, as curves (i), (ii), (iii), (iv), and (v), respectively. Spectrum (i) Is 
for Ae-Ti; spectrum (iO is for A6-T2; spectrum (iii) is for Ae-Ta; spectrum (iv) Is for 
A6-T4; and spectrum (v) Is for Ae-Ts. Those peaks marked (X) will be the subject of 
spectral replacement (see below). The peaks mariced (Y) are the endogenous 
metabolites associated with the animals' response to the applied stimulus. 

25 

For the control animals (A1-A5), the animal factors AFa, AFc. and AFd are all very 
small (less than about 0.05) while the animal factor AFa is large and approximately 
constant (about 0.5). Therefore, the spectra for the control animals is dominated by 
base spectrum A. Also, the time factor for base spectmm A, TFa, is approximately 
30 constant for all time points (about 0.45). Therefore, qualitatively (and as expected 
in a real control group), all spectra for the control animals are very similar. The 
master control spectmm, in this case, the mean of all 25 control spectra (5 animals, 
5 time points), Is shown In Figure 4. 
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For the second group of animals (Ae-Aio). the animal factor AFd is very small (less 
than about 0.05) while the animal factor AFa is about 0.5, and the animal factors 
AFb and AFc are about 1.0. Therefore, the spectra for the second group of animals 

5 is dominated by the base spectra A, B and C. Also, the time factor for base 
spectrum A, TFa, is approximately constant for all time points (about 0.45), while 
the time factor for the base spectrum B, TFb, varies from about 0.1 to about 0.65, 
and peaks at time point 3, and the time factor for the base spectrum C, TFc, varies 
from about 0.1 to about 0.55, and peaks at time point 4. Therefore, qualitatively, 

10 the spectra for the second group of animals will resemble the base spectrum A, but 
with varying amounts of base spectra B and C superposed thereupon. 

For the third group of animals (An-Ais), the animal factors AFb and AFc are very 
small (less than about 0.05) while the animal factor AFa is about 0.5, and the 

15 animal factor AFd is about 1 .0. Therefore, the spectra for the third group of animals 
is dominated by the base spectra A and D. Also, the time factor for base spectrum 
A, TFa, is approximately constant for all time points (about 0.45), while the time 
factor for the base spectrum D, TFq, varies from about 0.1 to about 0.75, and peaks 
at time point 4. Therefore, qualitatively, the spectra for the third group of animals 

20 will resemble the base spectrum A, but with varying amounts of base spectrum D 
superposed thereupon. 

As discussed above, base spectrum A qualitatively represents the spectrum for 
control animals (although it is also present in the spectra for dosed animals). For 
25 the purposes of this example, base spectmm B qualitatively represents a 

metabolite or metabolites of the administered drug/toxin (i.e., an interfering signal), 
while base spectra C and D qualitatively represent different biomari<ers or 
biomari<er combinations of the animals' response to the two different drug/toxin 
regimes. 

30 

Using a conventional analysis, the spectral regions associated with the interfering 
signal (i.e., in base spectrum B) were identified as target regions (in this example, 
spectral bin numbers 15-26 inclusive and 47-58 inclusive), and the data in these 
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spectral regions deleted from all spectra. The resulting "deleted" spectra were 
re-normalised and then analysed by principal component analysis. 

The resulting scores plot (PC2 versus PC1) is shown in Rgure 7. Two groups of 
data points were clearly separated (from the control population), specifically 
A6-Aio,T2 and Ae-Aio.Ta- Two groups of data points were partially separated, 
specifically Ae-AioTi and A6-Aio,T4. Several groups of data points were not 
separated, specifically Ae-Ai^Ts and Aii-Ai5,Ti-5. 

Using the methods of the present invention, the target regions, that is, the spectral 
regions associated with the Interfering signal (i.e.. In base spectrum B) were 
identified (in this example, spectral bin numbers 15-26 inclusive and 47-58 
Inclusive). A master control spectrum was calculated as the mean of the 25 control 
animal spectra (the master control spectrum is shown In Rgure 4). The target 
regions in all spectra for animals 6-10 were then replaced with corresponding 
scaled replacement regions from the master control spectrum. The resulting target- 
replaced spectra were then renormallzed to give nonmalised target-replace spectra. 

Two examples of this process are shown in Rgures 5 (for the spectrum for animal 
number 6 at time point 2. A6,T2) and Rgure 6 (for the spectrum for animal number 6 
at time point 3, A6,T3). In each case: spectrum (i) is the original spectrum, before 
replacement; spectrum 01) is the spectrum after spectral replacement; spectrum (iii) 
Is spectmm (ii) after re-nomialisatlon; the first target region (T-l) was spectral bin 
numbers 15-26 Inclusive, and the second target region (T-ll) was spectral bin 
numbers 47-58 Inclusive, as indicated by the vertical dotted lines. The numerical 
parameters are summarised in the table below. 
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Table 1 

Parameters for Spectral Replacement 




i = A6T2 




Iyi 


3.75 


2.56 


^ lYI.Tk 


1.77 


1.55 


ICM 


0.56 


0.55 


^ IcM.R 


0.24 


0.24 


fi 


6.33 


3.23 


N, 


1.82 


1.82 



The resulting normalised target-replaced spectra were then analysed by principal 
component analysis. In one analysis, the replaced regions were treated as 
"missing data" (a conventional method in PCA analysis) and the resulting scores 
5 plot (PC2 versus PCI) is shown in Figure 8. Nine groups of data points were 
clearly separated (from the control population), specifically Ae-AioTi-s and 
AirAi5,Ti-4. One group of data points was not separated, specifically An-AisJs. 

In another analysis, the replaced regions were not treated as "missing data" and 
10 the resulting scores plot (PC2 versus PC1) is shown in Figure 9. Eight groups of 
data points were clearly separated (from the control population), specifically 
As-AioiTm and Ah-AisTm. One group of data points was partially separated, 
specifically Ae-Ai^Ts. One group of data points was not separated, specifically 
AirAi5,T5. 

15 

The clear separation of the An-Aie data in Figures 8 and 9 (specifically, 
AirAi5,Ti^), as compared to the lack of their separation in Figure 7, demonstrates 
the effectiveness of the methods of the present invention, specifically, in retrieving 
information that would othenA^ise have been lost or missed. 

20 
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CLAIMS 

1 • A method for processing a sample spectrum comprising: 

replacing each of one or more target regions in said sample spectrum 
with a corresponding replacement region of a master control spectrum to 
give a target-replaced sample spectrum, 

wherein said replacement region has been scaled so as to have the 
same fraction of the total integrated Intensity in said target-replaced sample 
spectrum as It did in said master control spectrum. 

2. A method for processing a sample spectrum for a test sample, said method 
comprising the steps of: 

(a) Identifying, in said sample spectrum, one or more target regions 
for replacement; 

(b) providing a master control spectrum which comprises one 
replacement region corresponding to each of said target regions; and, 

(c) replacing each of said target regions with the con^esponding 
replacement region to give a target-replaced sample spectrum, 

wherein said replacement region has been scaled so as to have the 
same fraction of the total integrated intensity In said target-replaced sample 
spectrum as it did in said master control spectrum. 

3. A method according to claim 2, further comprising the subsequent step of: 

(d) nonnalising said target-replaced sample spectnim to give a 
normalised target-replaced sample spectrum. 

4. A method for processing a sample NMR spectrum for a test sample, said 
method comprising the steps of: 

(a) identifying, In said sample NMR spectrum, one or more target 
regions for replacement, wherein each of said target regions is defined by a 
chemical shift range; 

(b) providing a master control NMR spectmm which comprises one 
replacement region corresponding to each of said target regions, wherein a 
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target region and its corresponding replacement region are defined by the 
same chemical shift range; and, 

(c) replacing each of said target regions with the corresponding 
replacement region to give a target-replaced sample NMR spectrum, 

wherein said replacement region has been scaled so as to have the 
same fraction of the total integrated intensity in said target-replaced sample 
NMR spectrum as it did in said master control NMR spectrum. 

5. A method according to claim 4, further comprising the subsequent step of: 

(d) normalising said target-replaced sample NMR spectrum to give a 
normalised target-replaced sample NMR spectrum. 

6. A method according to any one of claims 2 to 5, wherein, 

in said replacing step (c), each of said target regions is replaced with 
the corresponding replacement region to give a target-replaced sample 
spectrum, 

wherein said replacement region has been scaled by a factor, f, given 
by the formula: 



Iy is the total integrated intensity of the sample spectrum; 
Iy.ttc is the integrated intensity of the target region; 
IcM is the total integrated intensity of the master control spectrum; 
lcM.Rk is the integrated intensity of the replacement region; 
k ranges from 1 to nt; and, 
r)i is number of target regions. 





k 



wherein: 
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7. A sample spectrum which has been processed by a method according to 
any one of claims 1 to 6. 



8. A method for processing a plurality of sample spectra, comprising 

5 processing each of said sample spectra by a method according to any one 

of claims 1 to 6. 

9. A method of analysis of an applied stimulus, comprising the steps of: 

(a) providing one or more sample spectra for each of one or more 

10 samples from each of one or more organisms which have been subjected to 

said applied stimulus; 

(b) providing a master control spectrum derived from one or more 
control spectra for each of one or more samples from each of one or more 
organisms which have not been subjected to said applied stimulus; 

^5 (c) processing each of said sample spectra using a method according 

to any one of claims 1 to 6. 

10. A method according to claim 9, wherein said applied stimulus is a xenoblotlc. 

20 11. A method according to claim 9, wherein said applied stimulus Is a disease 
state. 



12. A method according to dalm 9, wherein said applied stimulus is a genetic 
modification. 

>5 

1 3. A method for identifying a blomarker or biomarker combination for an applied 
stimulus, comprising a method of analysis according to any one of claims 9 
to 12. 



30 14. 



A biomarker or blomariter combination identified by a method according to 
dalm 13. 
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15. A method of diagnosis of an applied stimulus employing a biomarker 
identified by a metiiod according to claim 13. 

1 6. An assay wiiich employs a biomarker identified by a metliod according to 
5 claim 13. 

17. A method of classifying an applied stimulus, comprising a method of analysis 
according to any one of claims 9 to 12. 

10 18. A method of diagnosis of an applied stimulus, comprising a method of 
analysis according to any one of claims 9 to 12. 

19. A method of therapeutic monitoring of a subject undergoing therapy, 
comprising a method of analysis according to any one of claims 9 to 12. 

15 

20. A method of evaluating drug therapy and/or drug efficacy, comprising a 
method of analysis according to any one of claims 9 to 12. 

21 . A method of detecting toxic side-effects of drug, comprising a method of 
20 analysis according to any one of claims 9 to 1 2. 

22. A method of characterising and/or identifying a drug in overdose, comprising 
a method of analysis according to any one of claims 9 to 12. 

25 23. A method according to any one of claims 8 to 22, wherein said spectrum or 
spectra is an NMR spectrum or NMR spectra. 

24. A computer system operatively configured to implement a method according 
to any one of claims 1 to 23. 

30 

25. Computer code suitable for implementing a method according to any one of 
claims 1 to 23 on a suitable computer system. 
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26. A data carrier which carries computer code suitable for implementing a 
method according to any one of claims 1 to 23 on a suitable computer 
system. 

5 
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